Improving SMT by Using Parallel Data of a Closely Related Language

نویسندگان

Petra Galuscáková

Ondrej Bojar

چکیده

The amount of training data in statistical machine translation critically affects translation quality. In this paper, we demonstrate how to increase translation quality for one language pair by introducing parallel data from a closely related language. Specifically, we improve English→Slovak translation using a large Czech– English parallel corpus and a shallow MT system for Czech→Slovak translation. Several options are explored to identify the best possible configuration. We also present our two contributions to available data resources, namely the English–Slovak parallel corpus and the Slovak variant of the WMT 2011 test set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants associated with RANLP 2013

In this paper we describe the construction of a parallel corpus between the standard and a non-standard language variety, specifically standard Austrian German and Viennese dialect. The resulting parallel corpus is used for statistical machine translation (SMT) from the standard to the non-standard variety. The main challenges to our task are data scarcity and the lack of an authoritative ortho...

متن کامل

A Hybrid Machine Translation System Based on a Monotone Decoder

In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...

متن کامل

Addressing some Issues of Data Sparsity towards Improving English- Manipuri SMT using Morphological Information

The performance of an SMT system heavily depends on the availability of large parallel corpora. Unavailability of these resources in the required amount for many language pair is a challenging issue. The required size of the resource involving morphologically rich and highly agglutinative language is essentially much more fo r the SMT systems. This paper investigates on some of the issues on en...

متن کامل

Chained System: A Linear Combination of Different Types of Statistical Machine Translation Systems

The paper explores a way to learn post-editing fixes of raw MT outputs automatically by combining two different types of statistical machine translation (SMT) systems in a linear fashion. Our proposed system (which we call a chained system) consists of two SMT systems: (i) a syntax-based SMT system and (ii) a phrase-based SMT system (Koehn, 2004). We first translate source sentences of the bite...

متن کامل

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts p...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Improving SMT by Using Parallel Data of a Closely Related Language

نویسندگان

چکیده

منابع مشابه

Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants associated with RANLP 2013

A Hybrid Machine Translation System Based on a Monotone Decoder

Addressing some Issues of Data Sparsity towards Improving English- Manipuri SMT using Morphological Information

Chained System: A Linear Combination of Different Types of Statistical Machine Translation Systems

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

عنوان ژورنال:

اشتراک گذاری